July 6, 2019

Preview

  1. Introduction to single-cell RNA-seq
  2. Quality control and normalization
  3. Survey of downstream analysis methodology

Highly variable genes

  • Goal: identify a set of genes with high variability attributable to biology (over and above technical variability)
  • Useful for vizualization, dimension reduction, clustering, marker gene selection, etc
  • Challenging without orthogonal measurement of technical variability (e.g. spikeins)
    • with spikeins: select genes with variance significantly above mean-variance trend in control genes
    • without spikeins: select genes with variance significantly above overall mean-variance trend in all endogenous genes (assumes variance of most genes is purely technical)

Highly variable genes with spikeins

Highly variable genes without spikeins

Review: Dimension reduction

  • Useful to summarize & visualize relationships between cells in low dimensional space
  • Commonly used approaches:
    • PCA (Principal Components Analysis)
    • tSNE (t-distributed Stochastic Neighbor Embedding)
    • UMAP (Uniform Manifold Approximation and Projection)
  • Clustering can be carried out on reduced dimensions, but with caution

t-distributed Stochastic Neighbor Embedding (tSNE) algorithm

  1. random initialization, then iterate 2 & 3
  2. calculate two probabilities of similiarity (of being neighbors - perplexity parameter roughly governs how many neighbors to consider):
    • probability of similarity of points in high-dimensional space
    • probability of similarity of points in low-dimensional space
  3. minimize difference between these probabilities using Kullback-Leibler (KL) divergence
    • KL divergence is a distance metric for distributions: \[ D_{KL}(P|Q) = \sum_i p(x_i) log(\frac{p(x_i)}{q(x_1)}) \]

Uniform Manifold Approximation and Projection (UMAP)

  • Becht et al 2018 (https://doi.org/10.1038/nbt.4314)
  • UMAP attempts to map points to a global coordinate system that preserves local structure
  • Similar conceptually as tSNE, but specifics are different (e.g. different distributional kernels)
  • Comparable performance to tSNE, but slightly better at preserving distances and faster runtime

PCA vs nonlinear methods

Things slow down with large numbers of cells

General purpose clustering

  • Hierarchical: build and cut dendrogram based on pairwise distance matrix
  • K-means: iteratively assign cells to nearest cluster center
  • Density-based: clusters are defined as areas of higher density (e.g. DBSCAN)
  • Graph-based: assumes data can be represented as a graph structure; doesn’t require estimating pairwise distance matrix (e.g. SNN-Cliq)

drawingdrawingdrawing

Clustering for single-cell

  • Goal: automatically identify subpopulations of cell types/states
  • Many methods adapted/developed for single-cell to account for:
    • high dropout rate
    • batch effects
    • high dimensionality

Andrews & Hemburg 2018 (https://doi.org/10.1016/j.mam.2017.07.002)

Evaluation of clustering results

  • Many clustering algorithms require specification of the number of clusters (e.g. K in K-means)
  • Different algorithms may provide vastly different clusterings

Risso et al 2018 (https://doi.org/10.1371/journal.pcbi.1006378)

Metrics for evaluating clustering results

  • Single clustering:
    • Average silhouette: how similar a cell is to its own cluster (average distance within) compared to other clusters (average distance between); not efficient for large numbers of cells \[s(b) = \frac{\bar{d}_{\text{within}}(b) - \bar{d}_{\text{between}}(b)}{max(\bar{d}_{\text{within}}(b), \bar{d}_{\text{between}}(b))}\]
    • Modularity score: difference between number of within-cluster edges to the expected number under the null (random edges)
  • Comparing two clusterings
    • Adjusted rand index: similarity measure between two clusterings based on number of cells grouped accordingly and adjusted for random groupings

Benchmark of clustering methods for single-cell

Review: Differential expression

  • Traditional differential expression for bulk RNA-seq aims at detecting shifts in mean (fold change)
  • Read counts \(y_{bg}\) typically modeled with a two parameter distribution (e.g. mean \(\mu_{bg}\), dispersion \(\alpha_g\))
  • Recall DESeq2 model given size factors \(s_b\): \[ y_{bg} \sim NB(\mu_{bg}, \alpha_g)\] \[\mu_{bg}=s_b q_{bg}\] \[ log_2(q_{bg}) = x_{b.}\beta_g \]

Bulk RNA-seq measures averages

Heterogeneity hidden in bulk RNA-seq

Distributions of single-cell read counts

Differential distributions

  • scDD: Korthauer et al 2016 (https://doi.org/10.1186/s13059-016-1077-y)
  • Key idea: subpopulations of cell type/state will manifest as multimodal expression distributions
  • Approach: explicitly model within sample heterogeneity and identify genes with differential distributions (as opposed to differences in mean)

scDD framework

  1. Test for global difference in distribution of expressed values with nonparametric two-sample Kolmogorov-Smirnov test: \[ D_{n,m} = sup_x |F_{1,n}(x) - F_{2,m}(x)|\] \[ \text{Reject at level } \alpha \text{ if } D_{n,m} > \sqrt{\frac{-(n+m)ln(\alpha)}{2mn}} \]
  2. Test for difference in dropout: GLM adjusted for cellular detection rate
  3. Classify significant genes into summary patterns:
  • Fit \(K-\) component normal mixture models to normalized log transformed expression values within each condition
  • Select \(K\) by BIC
  • Select pattern based on number of components & their equality of means

scDD detects more subtle and complex changes

DE after clustering

  • A common workflow will include detection of DE genes between discovered subtypes (clusters)
  • Caution must be used in the interpretation of p-values here
  • Since clusters are formed by maximizing differences in expression, it is not surprising to discover DE genes between clusters - we can even detect changes in null data
  • We may only be interested in differences in average expression here

Alternative approach for DE in single cell

  • MAST: Finak et al. 2015 (http://doi.org/10.1186/s13059-015-0844-5)
  • Hurdle model that depends on whether gene \(g\) in cell \(b\) is expressed (Normalized log count \(Y_{bg} > 0\)) or not (\(Y_{bg} = 0\)): \[ logit(P(Y_{bg} = 0)) = X_b\beta^D_g\] \[ P(Y_{bg} | Y_{bg} > 0) \sim N(X_b\beta_g^C, \sigma_g^2) \]
  • Obtain \(\chi_{n}^{2}\) and \(\chi_{m}^{2}\) likelihood ratio test statistics for \(\beta_g^D\) and \(\beta_g^C\), with \(n\) and \(m\) degrees of freedom, respectively
  • Global test statistic for a change in either component is the sum of the two:
    • sum of \(\chi^2\) random variables yields another \(\chi^2\) with degrees of freedom equal to the sum of degrees of freedom of the two original variables: \[\chi_m^{2} + \chi_n^{2} \sim \chi_{n+m}^{2}\]

Many other approaches for DE analysis

Scaling up to millions of cells with Bioconductor

  • Some datasets may be so large that they do not fit into memory efficiently (or at all)
  • Handy infrastructure for computing on data that is saved on disk - allows computation on subsets:
    • HDF5 - file format interfaced with read/write functionalities in Bioconductor packages (rhdf5, Rhdf5lib, beachmat)
    • DelayedArray, DelayedMatrixStats - Bioconductor packages that provide an ecosystem for storing and computing on disk-backed files just like you would ordinary matrices
  • Configured for use in SingleCellExperiment and interfaces automatically with scater and scran
    • requires saving data as HDF5 format

HDF5 in action - 1.3 million cells

library(TENxBrainData)
assay(TENxBrainData(), "counts", withDimnames = FALSE)
## <27998 x 1306127> HDF5Matrix object of type "integer":
##                [,1]       [,2]       [,3]       [,4] ... [,1306124]
##     [1,]          0          0          0          0   .          0
##     [2,]          0          0          0          0   .          0
##     [3,]          0          0          0          0   .          0
##     [4,]          0          0          0          0   .          0
##     [5,]          0          0          0          0   .          0
##      ...          .          .          .          .   .          .
## [27994,]          0          0          0          0   .          0
## [27995,]          1          0          0          2   .          0
## [27996,]          0          0          0          0   .          0
## [27997,]          0          0          0          0   .          0
## [27998,]          0          0          0          0   .          0
##          [,1306125] [,1306126] [,1306127]
##     [1,]          0          0          0
##     [2,]          0          0          0
##     [3,]          0          0          0
##     [4,]          0          0          0
##     [5,]          0          0          0
##      ...          .          .          .
## [27994,]          0          0          0
## [27995,]          1          0          0
## [27996,]          0          0          0
## [27997,]          0          0          0
## [27998,]          0          0          0

Resources for handling large data

R tools for downstream analysis

  • scran: detection of highly variable genes, dimension reduction, clustering (Bioconductor)
  • Seurat: detection of highly variable genes, dimension reduction, clustering, DE (CRAN)
  • mbkmeans: minibatch K-means for large data (Bioconductor)
  • scDD: differential distributions (Bioconductor)
  • MAST: differential expression (Bioconductor)

drawingdrawing

There are many more tools I didn’t mention…

Growing number of computational tools

Curated list of tools from Sean

Install packages we’ll use in the second set of exercises

BiocManager::install("scDD")
BiocManager::install("MAST")